The MNIST dataset is a dataset of \(28\times 28\) images of hand-written digits. Download it from http://yann.lecun.com/exdb/mnist/ (you only really need the training images and labels though). To read these images in R, use the script from https://gist.github.com/brendano/39760. Make sure you understand this. Note that the \(\mathrm{show\_digit}\) command displays a particular digit.
Since the dataset is quite large, restrict yourself to the first 1000 training images, and their labels. Store these as variables called \(\mathrm{digits}\) and \(\mathrm{labels}\). \(\mathrm{digits}\) should be a \(1000\times 784\) matrix (or its transpose). Include R code.
Write a function \(\mathrm{my\_kmeans}\) to perform a k-means clustering of the 1000 images of digits. Use Euclidean distance as your distance measure between images (which can be viewed as vectors in a 784 dimensional space). Your function should take 3 arguments, the matrix \(\mathrm{digits}\), the number of clusters \(K\) and the number of initializations \(N\). You code should consist of 3 nested loops. The outermost (from \(1\) to \(N\)) cycles over random cluster initializations (i.e. you will call k-means \(N\) times with different initializations). The second loop (this could be a for or while loop) is the actual k-means algorithm for that initialization, and cycles over the iterations of k-means. Inside this are the actual iterations of k-means. Each iteration can have 2 successive loops from 1 to \(K\): the first assigns observations to each cluster and the second recalculates the means of each cluster. These should not require further loops. (You will probably encounter empty clusters. It is possible to deal with these in clever ways, but here it is sufficient to assign empty clusters a random mean (just like you initialized them)). Since your initializations are random, make your results repeatable by using the \(\mathrm{set.seed()}\) command at the beginning (you can also make the seed value a fourth argument). Your function should return:
Do not hardcode the number of images or their size. Include R code.
Explain briefly what stopping criteria you used (i.e. the details of the second loop).
Run your code on the 1000 digits for \(K = 5, 10, 20\). Set \(N\) to a largish number e.g. 25 (if this takes too long, use a smaller number).For each setting of \(K\), plot the cluster means (using \(\mathrm{show\_digit}\)) as well as the evolution of the loss-function for the best solution (you can use a semi-log plot if that is clearer). You do not have to print the other values returned by the function e.g. the cluster assignments, or the values of the cluster means etc., just plots is sufficient
For each setting of \(K\), plot the distribution of terminal loss function values (using \(\mathrm{ggplot}\)’s \(\mathrm{geom\_density}()\)).
Explain briefly how you might choose the number of clusters \(K\).
Modify your code to do k-medoids. You only need to change one part of your previous code viz. the part that calculates the cluster prototype given cluster assignments. The cleanest way to do this is to define a function called \(\mathrm{get\_prototype}\) that takes a set of observations and returns the prototype. For k-means this function just returns the mean of the observations. Note that the mean can also be defined as \[\mu = \arg \min_x \sum_{i = 1}^{|D|}(x - d_i)^2\] Here \(D = (d_1, \dots, D_{|D|})\) is the set of images input to \(\mathrm{get\_prototype}\), and the mean need not be part of this set. For k-medoids, the prototype is defined as \[\mu = \arg \min_{x\in D} \sum_{i = 1}^{|D|}(x - d_i)^2\] In other words it finds an element in \(D\) that minimizes the sum-squared distance. Include R code for your implementation of \(\mathrm{get\_prototype}\) for k-mediods. You can use as many for loops as you want, but the simplest is to loop over each observation assigned to that cluster, calculate the sum-squared distance when that is set as the prototype, and return the best choice.
Run k-medoids for \(K = 5, 10, 20\). Since this might take longer, you can use smaller values of \(N\) as well as fewer images (e.g. just 100 digits), but report what numbers you used. For each choice of \(K\), show the cluster prototypes. Comment on the quality of the cluster prototypes, as well as the value of the loss function vs k-means.
Solution:
# Credit to https://gist.github.com/brendano/39760
# Load the MNIST digit recognition dataset into R
load_image_file <- function(filename) {
ret = list()
f = file(filename,'rb')
readBin(f,'integer',n=1,size=4,endian='big')
ret$n = readBin(f,'integer',n=1,size=4,endian='big')
nrow = readBin(f,'integer',n=1,size=4,endian='big')
ncol = readBin(f,'integer',n=1,size=4,endian='big')
x = readBin(f,'integer',n=ret$n*nrow*ncol,size=1,signed=F)
ret$x = matrix(x, ncol=nrow*ncol, byrow=T)
close(f)
ret
}
load_label_file <- function(filename) {
f = file(filename,'rb')
readBin(f,'integer',n=1,size=4,endian='big')
n = readBin(f,'integer',n=1,size=4,endian='big')
y = readBin(f,'integer',n=n,size=1,signed=F)
close(f)
y
}
show_digit <- function(arr784, col=gray(12:1/12), ...) {
image(matrix(arr784, nrow=28)[,28:1], col=col,axes=F, ...)
}
# load the first 1000 digits and their labels
digits.list <- load_image_file('./data/train-images-idx3-ubyte')
digits <- digits.list$x[1:1000,]
digits <- digits / 255.0 # normalize pixels
labels <- load_label_file('./data/train-labels-idx1-ubyte')
dim(digits)
## [1] 1000 784
# snippets to test the function of show_digits
show_digit(digits[12,])
labels[12]
## [1] 5
class(labels[12])
## [1] "integer"
# define helper functions
# get distance between a single digit and all the centroids of clusters
getDist <- function(single.digits, centroids){
dist <- colSums((t(centroids) - single.digits)^2)
return(dist)
}
# update the centroid for a single cluster by calculating the mean of digits in this cluster
single.centroid <- function(ric, all.digits){
num.digits.in.cluster <- sum(ric)
if (num.digits.in.cluster > 0){
return((ric %*% all.digits) / num.digits.in.cluster )
}
else{
return(all.digits[sample(nrow(all.digits),1),])
}
}
# K-means algorithm starts here
# K is num of clusters, N is num of different initializations
my_kmeans <- function(all.digits, K, N){
set.seed(16)
best.cluster.parameters <- NULL # centroids
best.cluster.assignments <- NULL # ric
best.loss.seq <- c()
terminal.losses <- c()
best.terminal.loss <- Inf
for (i in seq(1,N)){
centroids <- all.digits[sample(nrow(all.digits),K), ]
dists <- apply(all.digits,1,getDist,centroids=centroids)
dists <- t(dists)
loss <- sum(apply(dists, 1, min))
last.loss <- loss + 1
loss.seq <- c(loss)
while (abs(last.loss - loss) >= 1e-4) {
rics <- apply(dists,1,function(d) ifelse(d==min(d),1,0))
rics <- t(rics)
centroids <- apply(rics, 2, single.centroid, all.digits=all.digits)
dists <- apply(all.digits, 1, getDist, centroids=t(centroids))
dists <- t(dists)
last.loss <- loss
loss <- sum(apply(dists, 1, min))
loss.seq <- c(loss.seq,loss)
}
terminal.losses <- c(terminal.losses,loss)
if (best.terminal.loss > loss){
best.terminal.loss <- loss
best.loss.seq <- loss.seq
best.cluster.assignments <- rics
best.cluster.parameters <- t(centroids)
}
}
return(list(best.cluster.parameters=best.cluster.parameters, best.cluster.assignments=best.cluster.assignments, best.loss.seq=best.loss.seq, terminal.losses=terminal.losses))
}
K <- c(5,10,20)
N <- 15
results <- list()
for (k in K){
results <- append(results, list(my_kmeans(digits,k,N)))
}
for (id in seq(1,5)){
show_digit(results[[1]]$best.cluster.parameters[id,])
}
* Case \(K = 10\)
for (id in seq(1,10)){
show_digit(results[[2]]$best.cluster.parameters[id,])
}
* Case \(K = 20\)
for (id in seq(1,20)){
show_digit(results[[3]]$best.cluster.parameters[id,])
}
* Evolution for the best solution in case \(K = 5\)
library(ggplot2)
best.loss.seq.df <- data.frame(loss=results[[1]]$best.loss.seq,iter=seq(1,length(results[[1]]$best.loss.seq)))
ggplot(best.loss.seq.df,aes(x=iter,y=loss)) +
geom_line() + geom_point() + ggtitle(label = 'K = 5')
* Evolution for the best solution in case \(K = 10\)
best.loss.seq.df <- data.frame(loss=results[[2]]$best.loss.seq,iter=seq(1,length(results[[2]]$best.loss.seq)))
ggplot(best.loss.seq.df,aes(x=iter,y=loss)) +
geom_line() + geom_point() + ggtitle(label = 'K = 10')
* Evolution for the best solution in case \(K = 20\)
best.loss.seq.df <- data.frame(loss=results[[3]]$best.loss.seq,iter=seq(1,length(results[[3]]$best.loss.seq)))
ggplot(best.loss.seq.df,aes(x=iter,y=loss)) +
geom_line() + geom_point() + ggtitle(label = 'K = 20')
terminal.losses.df <- data.frame(terminal.losses=results[[1]]$terminal.losses)
ggplot(terminal.losses.df,aes(x=terminal.losses)) +
geom_density() + ggtitle(label = 'K = 5')
* Case \(K = 10\)
terminal.losses.df <- data.frame(terminal.losses=results[[2]]$terminal.losses)
ggplot(terminal.losses.df,aes(x=terminal.losses)) +
geom_density() + ggtitle(label = 'K = 10')
* Case \(K = 20\)
terminal.losses.df <- data.frame(terminal.losses=results[[3]]$terminal.losses)
ggplot(terminal.losses.df,aes(x=terminal.losses)) +
geom_density() + ggtitle(label = 'K = 20')
# get_prototype function is a substitute to function single.centroid
get_prototype <- function(ric, all.digits){
digits.in.cluster <- all.digits[ric==1,]
smallest.dist <- Inf
closest.digit <- digits.in.cluster[1,]
if (nrow(digits.in.cluster) == 0){
return(all.digits[sample(nrow(all.digits),1), ])
}
for (i in seq(1,nrow(digits.in.cluster))){
digit <- digits.in.cluster[i, ]
dist.to.others <- sum(getDist(digit,digits.in.cluster))
if (dist.to.others < smallest.dist){
closest.digit <- digit
smallest.dist <- dist.to.others
}
}
return(closest.digit)
}
# define function my_kmedoids
my_kmedoids <- function(all.digits, K, N){
set.seed(26)
best.cluster.parameters <- NULL # centroids
best.cluster.assignments <- NULL # ric
best.loss.seq <- c()
terminal.losses <- c()
best.terminal.loss <- Inf
for (i in seq(1,N)){
centroids <- all.digits[sample(nrow(all.digits),K), ]
dists <- apply(all.digits,1,getDist,centroids=centroids)
dists <- t(dists)
loss <- sum(apply(dists, 1, min))
last.loss <- loss + 1
loss.seq <- c(loss)
while (abs(last.loss - loss) >= 1e-4) {
rics <- apply(dists,1,function(d) ifelse(d==min(d),1,0))
rics <- t(rics)
centroids <- apply(rics, 2, get_prototype, all.digits=all.digits)
dists <- apply(all.digits, 1, getDist, centroids=t(centroids))
dists <- t(dists)
last.loss <- loss
loss <- sum(apply(dists, 1, min))
loss.seq <- c(loss.seq,loss)
}
terminal.losses <- c(terminal.losses,loss)
if (best.terminal.loss > loss){
best.terminal.loss <- loss
best.loss.seq <- loss.seq
best.cluster.assignments <- rics
best.cluster.parameters <- t(centroids)
}
}
return(list(best.cluster.parameters=best.cluster.parameters, best.cluster.assignments=best.cluster.assignments, best.loss.seq=best.loss.seq, terminal.losses=terminal.losses))
}
digits.smaller <- digits[1:100, ]
K <- c(5,10,20)
N <- 10
results <- list()
for (k in K){
results <- append(results, list(my_kmedoids(digits,k,N)))
}
for (id in seq(1,5)){
show_digit(results[[1]]$best.cluster.parameters[id,])
}
* Case \(K = 10\)
for (id in seq(1,10)){
show_digit(results[[2]]$best.cluster.parameters[id,])
}
* Case \(K = 20\)
for (id in seq(1,20)){
show_digit(results[[3]]$best.cluster.parameters[id,])
}
* Evolution for the best solution in case \(K = 5\)
best.loss.seq.df <- data.frame(loss=results[[1]]$best.loss.seq,iter=seq(1,length(results[[1]]$best.loss.seq)))
ggplot(best.loss.seq.df,aes(x=iter,y=loss)) +
geom_line() + geom_point() + ggtitle(label = 'K = 5')
* Evolution for the best solution in case \(K = 10\)
best.loss.seq.df <- data.frame(loss=results[[2]]$best.loss.seq,iter=seq(1,length(results[[2]]$best.loss.seq)))
ggplot(best.loss.seq.df,aes(x=iter,y=loss)) +
geom_line() + geom_point() + ggtitle(label = 'K = 10')
* Evolution for the best solution in case \(K = 20\)
best.loss.seq.df <- data.frame(loss=results[[3]]$best.loss.seq,iter=seq(1,length(results[[3]]$best.loss.seq)))
ggplot(best.loss.seq.df,aes(x=iter,y=loss)) +
geom_line() + geom_point() + ggtitle(label = 'K = 20')
* distribution of terminal loss function values in case \(K = 5\)
terminal.losses.df <- data.frame(terminal.losses=results[[1]]$terminal.losses)
ggplot(terminal.losses.df,aes(x=terminal.losses)) +
geom_density() + ggtitle(label = 'K = 5')
* distribution of terminal loss function values in case \(K = 10\)
terminal.losses.df <- data.frame(terminal.losses=results[[2]]$terminal.losses)
ggplot(terminal.losses.df,aes(x=terminal.losses)) +
geom_density() + ggtitle(label = 'K = 10')
* distribution of terminal loss function values in case \(K = 20\)
terminal.losses.df <- data.frame(terminal.losses=results[[3]]$terminal.losses)
ggplot(terminal.losses.df,aes(x=terminal.losses)) +
geom_density() + ggtitle(label = 'K = 20')